Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve custom transpilation for faster 1Q/2Q RB #922

Merged

Conversation

itoko
Copy link
Contributor

@itoko itoko commented Sep 27, 2022

Summary

This PR is a follow-up of #892 for further performance improvement with several clean-ups suggested in the review of PR 892.

Details and comments

The drastic performance improvement in 1Q/2Q RB (PR 892) was mostly come from two ideas:

  1. Custom transpilation (mapping circuits to physical qubits without using transpile)
  2. Integer-based Clifford operations (especially sparse lookup table with triplet decomposition for 2Q Clifford circuits)

PR 892 implements more than those two (including some not contributing to the performance improvement) and they make difficult to track what really improved the performance and seem overly reduce maintainability of the code.

This PR cleans up the implementation of Standard/InterleavedRB without performance regression and add further optimization in the implementation of (1) custom transpilation with following several suggestions given in PR 892. For insetances,

  • Change back to the simple code structure introduced at Refactor RB module for future extensions #898
  • Change back to return wrapped Clifford circuits (not unwrapped ones), such a change (wrapped -> unwrapped) should be discussed as an independent issue
  • Move StandardRB._layout_for_rb to clifford_utils._transpile_clifford_circuit and add further optimization.
    (See also the original draft PR to roughly understand which code blocks are essentially changed for above updates)
  • Fix a bug that ignores user calibrations in custom transpilation of 1Q/2Q RB circuits

Other features added in this PR includes:

This PR achieves further speed-up (1.2x for 1Q StandardRB, 1.1x for 2Q StandardRB, 3.4x for 1Q InterleavedRB, 1.6x for 2Q InterleavedRB) measured by this benchmark.
This PR will be followed by one more PR that cleans up clifford_utils, which is left with minimal updates in this PR.


# The following methods are used for RB with more than 2 qubits
def _sample_circuits(self):
self._cliff_utils = CliffordUtils(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be promoted to actual generator by including _sample_sequences and _sequences_to_circuits? I am implementing another RB variant, that partly uses Clifford sequence but final structure is totally different from standard RB (thus no inheritance). I don't want to implement boilerplate code in such a class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but let me go there (having RBCircuitGenerator class to delegate whole circuit generation) step by step. I'd like to focus on clean-up of Standard/InterleavedRB implementation in this PR, then clean-up of clifford_utils in the next PR (I'll remove the creation of CliffordUtils objects there). I prefer further refactoring should come after them.

if self.num_qubits in [1, 2]:
transpiled = self._layout_for_rb()
has_custom_transpile_option = (
any(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized current implementation induces a serious issue when benchmarking the user-calibrated gate. Since conventional sequence implicitly triggers the pulse gate pass in the default pass managers that pulls gate calibrations from the backend instmap that doesn't explicitly show up in the transpile option. This is indeed the breaking API change, because custom calibration will be just ignored but user will never notice this. I think you need to explicitly manage inst map without using pass manager (for performance reason). Alternatively, we can assume V2 backend, where calibration data is always provided with gate entry. Note that this PR will be merged soon. This makes me think we need unittest for missing calibrations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cee4e78

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have warning or completely remove transpile options, i.e overriding _set_transpile_options method to raise warning/error. Usually users are not aware of what is happening inside .run() and may set some random option from random example code. I think having inst_map only makes sense in this experiment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I like the idea restricting acceptable transpile options for each experiment. (too many possible transpile_options was one of my headaches.) For RB experiments, I don't like to but I'll accept basis_gates and optimization_level for now as well as inst_map. Because they are widely used in the tests (Fortunately, they seem not to be used in the tutorial). Forbidding them would require large refactoring of tests, so if we want that, I think it should be done in a separate PR.

@itoko itoko marked this pull request as ready for review September 30, 2022 11:00
@itoko
Copy link
Contributor Author

itoko commented Oct 3, 2022

This PR is now ready to be reviewed.

Copy link
Contributor

@merav-aharoni merav-aharoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am submitting the comments I have up to here, so that you can have a look at them. I will continue to review, probably only on Thursday, because of our holiday.

Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @itoko san. Looks almost good to me. Mainly nitpicks.

transpiled = super()._transpiled_circuits()
if self.analysis.options.get("gate_error_ratio", None) is None:
else:
transpiled = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think worth having multi-threading here? Note that the standard Qiskit transpiler uses qiskit parallel and this is single thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's worth investigating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've examined parallel transpilation using qiskit.tools.parallel.parallel_map and found that it's much slower than serial one. I guess that is due to the overhead of pickling circuits. Let me keep the current code as is.

Examined code for parallel transpilation:

            circuits = self.circuits()
            if (
                    len(circuits) > 1
                    and os.getenv("QISKIT_IN_PARALLEL", "FALSE") == "FALSE"
                    and parallel.PARALLEL_DEFAULT
            ):
                # Custom transpilation in parallel
                transpiled = parallel.parallel_map(
                    _transpile_clifford_circuit,
                    circuits,
                    task_kwargs={"physical_qubits": self.physical_qubits},
                )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe because I'm using Mac (I somewhere heard parallel does multi-processing so slow in Mac)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On mac by default parallel_map won't actually run in parallel on Python >= 3.8 (https://github.com/Qiskit/qiskit-terra/blob/main/qiskit/tools/parallel.py#L61-L83 defines the defaults) because in Python 3.8 they switched macOS multiprocessing to use spawn instead of fork by default because of some platform bugs. Spawn has a lot of extra overhead and it's also unreliable for user code because you need a if __name__ == "__main__" guard many times to distinguish a child process from the parent so we disable multiprocessing by default on macOS.

In general multiprocessing in python can be very tricky to work with and get a performance boost. Although for transpile() with linux (so using fork()) with involved compilation the performance is typically better even with the pickling and IPC overhead of separate processes (with some caveats Qiskit/qiskit#7741 )

interleaved_circ.name = f"Clifford-{interleaved_circ.name}"
# Transpile circuit with non-basis gates and remove idling qubits
try:
interleaved_circ = transpile(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need full transpile here? Likely you just need to call basis translator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think it was for delays but it may be unnecessary now since I updated the handling of delays in 36193c9
Let me check if we can replace transpile with basis translation.

Copy link
Contributor Author

@itoko itoko Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it's possible but it would be difficult to explain the spec in doc (since we'd need to mention to a protected method _transform_clifford_circuit). Now I think we should revisit here when we redesign the basis translation scheme (removing the use of transpile for 1Q/2Q RB) in the future.

Copy link
Contributor

@merav-aharoni merav-aharoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completed the review. Overall, I think the code looks very nice. I also like the split you made in test_randomized_benchmarking.

@merav-aharoni
Copy link
Contributor

Regarding the optimization_level, when I change it to be 0 in all the internal calls to transpile, I get the following error in many of the tests:

qiskit.exceptions.QiskitError: 'wrong param 6.28318530717959 for rz in clifford'

I am not sure exactly where this comes from, but it seems like level 0 does not do a good enough job at transpiling the Cliffords.

@itoko
Copy link
Contributor Author

itoko commented Oct 12, 2022

Benchmarking with https://gist.github.com/itoko/f078218ea6458f32d1c0f9be827f614f

main: Main branch (0359f4c)
prev: Before this PR (0b15ccb)
this: After this PR (2052eac)

method = time_transpiled_circuits for both settings

Standard RB 1Q:

backend max_length num_samples main[s] prev[s] this[s] prev->this
fake_manila 1000 3 6.858569 0.851461 0.817444 0.960049
fake_manila 1000 6 12.919110 1.660500 1.478235 0.890235
fake_manila 2000 3 12.545703 1.642936 1.431839 0.871513
fake_manila 2000 6 25.421245 3.265286 2.816278 0.862491
fake_mumbai 1000 3 6.696955 0.868488 0.767947 0.884235
fake_mumbai 1000 6 13.016662 1.817998 1.430544 0.786879
fake_mumbai 2000 3 12.516374 1.767419 1.416315 0.801346
fake_mumbai 2000 6 24.813701 3.380879 2.778376 0.821791
fake_washington 1000 3 10.513716 0.965256 0.795942 0.824592
fake_washington 1000 6 13.176188 1.944821 1.458366 0.749872
fake_washington 2000 3 12.700285 1.831261 1.451279 0.792503
fake_washington 2000 6 25.937748 3.700574 2.786020 0.752862

Standard RB 2Q:

backend max_length num_samples main[s] prev[s] this[s] prev->this
fake_manila 100 3 2.174022 0.373098 0.565202 1.514887
fake_manila 100 6 3.476085 0.743208 1.428178 1.921639
fake_manila 200 3 3.023454 0.726553 0.657691 0.905221
fake_manila 200 6 6.317040 1.353940 0.987601 0.729428
fake_mumbai 100 3 2.308012 0.362461 0.500592 1.381091
fake_mumbai 100 6 3.734278 0.707195 0.652423 0.922552
fake_mumbai 200 3 3.167860 0.684783 0.625557 0.913510
fake_mumbai 200 6 6.480515 1.380438 0.975742 0.706835
fake_washington 100 3 6.012488 0.407936 0.482093 1.181785
fake_washington 100 6 3.716712 0.836933 0.638941 0.763431
fake_washington 200 3 3.445951 0.733632 0.680576 0.927680
fake_washington 200 6 6.524212 1.499667 1.030386 0.687077

Interleaved RB 1Q:

backend max_length num_samples main[s] prev[s] this[s] prev->this
fake_manila 1000 3 18.058308 4.673694 1.485711 0.317888
fake_manila 1000 6 35.537921 9.215669 2.770748 0.300656
fake_manila 2000 3 35.208998 9.280712 2.657422 0.286338
fake_manila 2000 6 71.094951 18.541894 5.288353 0.285211
fake_mumbai 1000 3 17.977917 4.591728 1.393617 0.303506
fake_mumbai 1000 6 35.188215 9.284733 2.724183 0.293405
fake_mumbai 2000 3 34.964976 8.912482 2.702437 0.303219
fake_mumbai 2000 6 71.040783 18.087106 5.337821 0.295117
fake_washington 1000 3 22.380718 5.125869 1.415023 0.276055
fake_washington 1000 6 36.875958 10.234201 2.752939 0.268994
fake_washington 2000 3 36.186682 10.308951 2.701694 0.262073
fake_washington 2000 6 71.671835 19.716134 5.364398 0.272082

Interleaved RB 2Q:

backend max_length num_samples main[s] prev[s] this[s] prev->this
fake_manila 100 3 3.804730 0.663383 0.596649 0.899403
fake_manila 100 6 7.130793 1.335523 0.860453 0.644282
fake_manila 200 3 6.238839 1.277523 0.833921 0.652764
fake_manila 200 6 12.598263 2.683764 1.338446 0.498720
fake_mumbai 100 3 3.748988 0.694320 0.561316 0.808441
fake_mumbai 100 6 7.109178 1.358674 0.890352 0.655310
fake_mumbai 200 3 6.365602 1.334885 0.861731 0.645547
fake_mumbai 200 6 12.608916 2.633852 1.461120 0.554746
fake_washington 100 3 7.953887 0.819290 0.599565 0.731811
fake_washington 100 6 7.819094 1.608669 0.925331 0.575215
fake_washington 200 3 6.671117 1.493850 0.825585 0.552656
fake_washington 200 6 13.213095 2.938669 1.373048 0.467235

Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with minor suggestions. They don't block merging.

circ = QuantumCircuit(self.num_qubits)
circ.barrier(qubits)
circ.append(Barrier(self.num_qubits), circ.qubits)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not addressed. Of course you can do this in the follow-up.

@itoko itoko force-pushed the feature-rb-custom-transpile branch from de2de95 to 9142df4 Compare October 12, 2022 08:19
@itoko itoko merged commit 7dfdcbb into qiskit-community:feature/rb_speedup Oct 12, 2022
@itoko itoko mentioned this pull request Nov 29, 2022
itoko added a commit to itoko/qiskit-experiments that referenced this pull request Nov 29, 2022
* Improve custom transpilation for faster 1Q/2Q RB with simplifying code structure

* Avoid using general Gate objects

* Fix handling of interleaved delays

* Add custom calibrations support in transpiling circuits

* Add more comments

* Small cleanups

* More precise retrieve of basis gates

* Update qiskit_experiments/library/randomized_benchmarking/rb_experiment.py

Co-authored-by: Naoki Kanazawa <nkanazawa1989@gmail.com>

* Renames and updates following review comments

* Remove experiment_type from circuit metadata

* Add backend V1 to V2 conversion

* Fix docs following review comments

* Separate a code block as a private method

* Change to create BackendTiming object outside loop

Co-authored-by: Naoki Kanazawa <nkanazawa1989@gmail.com>
itoko added a commit to itoko/qiskit-experiments that referenced this pull request Nov 29, 2022
* Improve custom transpilation for faster 1Q/2Q RB with simplifying code structure

* Avoid using general Gate objects

* Fix handling of interleaved delays

* Add custom calibrations support in transpiling circuits

* Add more comments

* Small cleanups

* More precise retrieve of basis gates

* Update qiskit_experiments/library/randomized_benchmarking/rb_experiment.py

Co-authored-by: Naoki Kanazawa <nkanazawa1989@gmail.com>

* Renames and updates following review comments

* Remove experiment_type from circuit metadata

* Add backend V1 to V2 conversion

* Fix docs following review comments

* Separate a code block as a private method

* Change to create BackendTiming object outside loop

Co-authored-by: Naoki Kanazawa <nkanazawa1989@gmail.com>
itoko added a commit that referenced this pull request Dec 23, 2022
Improve the performance of transpiled circuit generation in 1Q/2Q StandardRB/InterleavedRB (about 10x speedup in 1Q SRB/IRB and 5x speedup in 2Q SRB/IRB in most cases).

Including following feature pull-requests:
* New algorithm for RB building Cliffords by layers (#892)
* Improve custom transpilation for faster 1Q/2Q RB (#922)
* Improve integer-based Clifford operations for 1Q/2Q RB (#940)
* Readd support for backends with directed basis gates to RB experiments (#960)

Features other than speedup:
* Fix performance regression in 3 or more qubits RB (embedded at Refactor RB module for future extensions #898)
* Improve validation of interleave_element in InterleavedRB
* Restructure tests for RB module
* Remove the barriers in front of the first Cliffords in RB circuits

Co-authored-by: merav-aharoni <46567124+merav-aharoni@users.noreply.github.com>
Co-authored-by: Naoki Kanazawa <nkanazawa1989@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants